Skip to content

Fix event loop blocking with async subprocess operations#3

Closed
Tyler-Rak wants to merge 4 commits intoreleasefrom
feature/async-subprocess-fix
Closed

Fix event loop blocking with async subprocess operations#3
Tyler-Rak wants to merge 4 commits intoreleasefrom
feature/async-subprocess-fix

Conversation

@Tyler-Rak
Copy link
Copy Markdown
Owner

Summary

Fixes pod restart issues caused by blocking subprocess operations during large PR reviews by converting git operations to non-blocking async subprocess calls.

Problem

  • Large PRs (like Tr/readme links The-PR-Agent/pr-agent#760 with 476 files) caused pod restarts due to liveness probe failures
  • Root cause: blocking subprocess.run() calls freeze event loop for 5-6+ seconds
  • Critical git operations: clone (~5-6s), fetch (~2-3s), show (~1-2s)

Solution

Converted blocking subprocess calls to asyncio.create_subprocess_exec():

  • _clone_inner() - git clone operation
  • _get_diff_files_with_git_clone() - git fetch operation
  • get_file_from_git() - git show operation

Changes

Core Implementation

  • Added async helper functions in git_provider.py:
    • run_subprocess_async() - non-blocking subprocess execution
    • run_subprocess_with_output() - non-blocking with output capture
  • Made clone() and get_diff_files() async in base class
  • Updated all 3 blocking git operations to use async helpers

Propagated Changes

  • Made methods async across tool files (pr_reviewer, pr_description, pr_code_suggestions, etc.)
  • Added await to all get_diff_files() calls
  • Made helper functions async (dedent_code, validate functions)

Test Results

All unit tests pass: 204 tests passed in 4.03s

Async functionality tests pass:

  • Basic async subprocess: 0.006s
  • Output capture: 0.008s
  • Timeout handling: Correct
  • Concurrent execution verified: 3 commands in 1.026s (would be 3s if blocking)
  • Real git operations: 0.019s

Merge Conflict Risk

  • bitbucket_server_git_provider.py: 0% risk (Rakuten-specific file)
  • git_provider.py: 30-40% risk (base class)
  • Tool files: 50-70% risk (actively developed)
  • Overall: Medium-High (50-60%) but EASY to resolve (systematic async/await additions)

Deployment Plan

  1. Deploy to staging environment
  2. Test with large PR (e.g., Tr/readme links The-PR-Agent/pr-agent#760)
  3. Monitor pod health (should not restart during processing)
  4. Verify PR review completes successfully

Documentation

  • docs/rakuten/ASYNC_BLOCKING_ANALYSIS.md - Problem analysis
  • docs/rakuten/SUBPROCESS_FIX_PLAN.md - Implementation plan

Testing

Local async tests verified:

  • Non-blocking subprocess execution works
  • Concurrent operations run in parallel
  • Event loop remains responsive
  • Git operations are properly async

Ready for deployment testing on k8s.

Move all custom documentation to docs/rakuten/ for clear separation
from upstream docs:
- Move architecture docs to docs/rakuten/architecture/
- Add CONCURRENCY_LIMITS.md with analysis of API rate limits and
  safe concurrency levels for Rakuten CaaS deployment
- Add README.md to explain custom vs upstream documentation

This organization makes it clear which docs are Rakuten-specific and
prevents merge conflicts with upstream changes.
@Tyler-Rak Tyler-Rak changed the base branch from main to release December 9, 2025 12:11
@Tyler-Rak
Copy link
Copy Markdown
Owner Author

Closing this PR to create a cleaner one with only the async subprocess changes (without the queue monitoring commits)

@Tyler-Rak Tyler-Rak closed this Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant